[Model] Add Index-AniSora I2V support (V1 5B + V2 14B) by dorhuri123 · Pull Request #877 · vllm-project/vllm-omni

dorhuri123 · 2026-01-20T22:58:49Z

Summary

This PR adds support for Index-AniSora Image-to-Video models, a family of anime-optimized video generation models developed by Bilibili. Supports both the 5B (CogVideoX-based) and 14B (Wan2.1-based) variants.

Closes #670

Supported Models

Model	Architecture	VRAM Required	HuggingFace
AniSora V1 (5B)	CogVideoX	~24GB	`IndexTeam/AniSora-v1-i2v-diffusers`
AniSora V2/V3 (14B)	Wan2.1	~65GB	`aardsoul-music/Wan2.1-Anisora-14B`

Demo Results

AniSora V1 (5B) - RTX 6000

Input Image:

Generation Settings:

Prompt: "A cat playing with yarn"
Resolution: 480 × 720
Frames: 81 frames @ 16fps
Inference steps: 50
Guidance scale: 5.0

Output Video (5.06 seconds):

anisora_v1_demo_gh.mp4

AniSora V2 (14B) - Short - NVIDIA H200

Input Image:

Generation Settings:

Prompt: "a panda eating bamboo, natural lighting, detailed fur"
Resolution: 480 × 832
Frames: 17 frames @ 8fps
Inference steps: 30
Guidance scale: 5.0

Output Video (2.1 seconds):

anisora_v2_output_gh.mp4

AniSora V2 (14B) - Long - NVIDIA H200

Input Image:

Generation Settings:

Prompt: "a woman smiling gently, soft natural lighting, cinematic quality, subtle head movement, flowing hair"
Resolution: 480 × 832
Frames: 49 frames @ 8fps
Inference steps: 30
Guidance scale: 5.0

Output Video (6.1 seconds):

anisora_v2_long.mp4

Usage

V1 (5B)

python examples/offline_inference/image_to_video/anisora_image_to_video.py \
  --model IndexTeam/AniSora-v1-i2v-diffusers \
  --image input.png \
  --prompt "anime scene, smooth motion" \
  --height 480 \
  --width 720 \
  --num_frames 81 \
  --guidance_scale 5.0 \
  --num_inference_steps 50 \
  --fps 16 \
  --output anisora_v1.mp4

V2/V3 (14B)

python examples/offline_inference/image_to_video/anisora_v2_image_to_video.py \
  --image input.png \
  --prompt "anime scene, high quality animation" \
  --height 480 \
  --width 832 \
  --num-frames 49 \
  --guidance-scale 5.0 \
  --num-inference-steps 30 \
  --fps 8 \
  --output anisora_v2.mp4

Changes

New Files

vllm_omni/diffusion/models/anisora/ - AniSora pipeline module
- pipeline_anisora_i2v_cogvideox.py - V1 (5B) CogVideoX-based pipeline
- pipeline_anisora_v2_i2v.py - V2/V3 (14B) Wan2.1-based pipeline with hybrid loading
- __init__.py - Module exports
examples/offline_inference/image_to_video/anisora_image_to_video.py - V1 CLI example
examples/offline_inference/image_to_video/anisora_v2_image_to_video.py - V2 CLI example

Modified Files

examples/offline_inference/image_to_video/README.md - Added AniSora documentation
vllm_omni/diffusion/registry.py - Register AniSora V1/V2 pipelines and their post-/pre-process hooks

Technical Notes

V2 Hybrid Loading

The V2 pipeline uses a hybrid loading approach because community-converted AniSora weights use different config/naming:

VAE, T5 text encoder, CLIP image encoder loaded from Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
Transformer weights loaded from community AniSora checkpoints
Includes comprehensive key name conversion (AniSora → diffusers format)

Key Name Conversions

Community AniSora weights use different naming conventions:

self_attn → attn1
cross_attn → attn2
ffn → ff
k → to_k, q → to_q, v → to_v, o → to_out.0
modulation → scale_shift_table
And additional mappings for full compatibility

Testing

Model	GPU	Result
V1 (5B)	RTX 6000 (~24GB)	✅ Generates valid video with motion
V2 (14B)	NVIDIA H200 (~140GB)	✅ Generates valid video with motion

Both pipelines produce output with proper animation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81f0eab187

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-20T23:08:07Z

+    def __init__(
+        self,
+        *,
+        model_path: str = "Disty0/Index-anisora-5B-diffusers",
+        dtype: torch.dtype = torch.bfloat16,


Accept od_config in AniSora pipeline constructor

OmniDiffusion instantiates all registered diffusion models via initialize_model, which always calls model_class(od_config=od_config). This constructor only accepts model_path/dtype/device, so using AniSora through the normal Omni/Diffusers loader path will immediately raise a TypeError for the unexpected od_config kwarg and prevent the model from loading at all.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-20T23:08:07Z

+    def __call__(
+        self,
+        prompt: str | list[str],
+        image: PIL.Image.Image,


Provide forward(req) entry point for AniSora V2

The diffusion engine executes models via pipeline.forward(req) (with an OmniDiffusionRequest), but this class only defines __call__(prompt, image, ...) and never overrides forward. That means nn.Module.forward will raise NotImplementedError at runtime even if the model loads, so AniSora V2 cannot be run through Omni until a forward wrapper is added.

Useful? React with 👍 / 👎.

lishunyang12

Thanks for your contributions. Amazing work, I will check it these two days.

lishunyang12 · 2026-01-24T09:52:10Z

I saw you introduce new example files. Is it possible to reuse script we already have?

lishunyang12

@SamitHuang @ZJY0516 PTAL

lishunyang12 · 2026-01-27T17:15:45Z

Fix conflicts, thanks

dorhuri123 · 2026-01-28T10:11:48Z

Thanks for the review! I rebased the PR on main, resolved conflicts, and pushed the updated branch.

V2 (Wan2.1) updates:

Added a minimal fallback in OmniDiffusion for AniSora V2/V3 repos that don’t ship model_index.json, so _class_name resolves to AniSoraV2I2VPipeline.
Fixed CLIP conditioning in the V2 pipeline to prefer pil_image over preprocessed_image (avoids PIL conversion errors during warmup).
For large-model runs (e.g., AniSora V2), I used PYTORCH_ALLOC_CONF=expandable_segments:True to reduce allocator fragmentation on big GPUs.

V1 (CogVideoX) updates:

Added CogVideoXImageToVideoPipeline alias in the registry for the 5B model.
Added pre/post-process hooks for the CogVideoX pipeline.

Shared updates (V1 + V2):

Implemented required forward() + load_weights() for vLLM integration.
Fixed image preprocessing & device handling and corrected self.dtype usage.
Declared support_image_input=True and ensured model weights are moved to device before inference.

Testing:

V1 (CogVideoX) ran on RTX 6000 S without setting PYTORCH_ALLOC_CONF.
V2 (Wan2.1) ran on H200 (short config) with the allocator env set.

Copilot

Pull request overview

This PR adds Index-AniSora image-to-video support to vLLM-Omni, covering both the CogVideoX-based 5B model and the Wan2.1-based 14B models, and wires them into the Omni diffusion registry and offline inference examples.

Changes:

Extend OmniDiffusion initialization logic to infer AniSora V2/V3 Wan2.1-based pipelines when model_index.json is missing and only config.json is available.
Register new AniSora pipelines (AniSoraI2VCogVideoXPipeline and AniSoraV2I2VPipeline) with corresponding pre-/post-processing hooks and implement their model loading, key-conversion, and I2V sampling logic.
Update image-to-video examples and docs to describe AniSora usage and add the AniSora 5B pipeline to the supported models list.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`vllm_omni/entrypoints/omni_diffusion.py`	Adds a `FileNotFoundError` guard when `model_index.json` is absent and introduces a special-case fallback that maps AniSora V2/V3 Wan2.1-based model IDs to the `AniSoraV2I2VPipeline`.
`vllm_omni/diffusion/registry.py`	Registers `AniSoraV2I2VPipeline` and `AniSoraI2VCogVideoXPipeline` with their pre-/post-process hooks, while also removing the `FluxPipeline` registry entries and the central sequence-parallelism hook.
`vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py`	Introduces the Wan2.1-based AniSora V2/V3 I2V pipeline with hybrid loading (Wan2.1 base components + AniSora transformer weights, including key-name conversion, VAE-based conditioning, and FlowUniPC sampling).
`vllm_omni/diffusion/models/anisora/pipeline_anisora_i2v_cogvideox.py`	Adds a CogVideoX-based AniSora 5B I2V pipeline using diffusers’ native CogVideoX components and implements image encoding, 3D rotary embeddings, and DDIM-based denoising.
`vllm_omni/diffusion/models/anisora/__init__.py`	Exposes the two new AniSora pipelines as part of the diffusion models package.
`examples/offline_inference/image_to_video/image_to_video.py`	Generalizes the example script description/usage to include AniSora 5B and 14B models alongside existing Wan2.2 I2V/TI2V models.
`examples/offline_inference/image_to_video/README.md`	Expands the image-to-video README with dedicated AniSora V1/V2 sections and example commands, plus reorganized Wan2.2 usage notes.
`docs/models/supported_models.md`	Updates the supported-models table to include `AniSoraI2VCogVideoXPipeline` and remove some previous rows (e.g., Flux, certain TTS entries).
`docs/.nav.yml`	Adds navigation entries for LoRA inference examples and several Omni connector design docs in the user guide and design sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T21:11:37Z

+
+
+# Simple test
+if __name__ == "__main__":
+    import urllib.request
+
+    print("Testing AniSora I2V CogVideoX Pipeline...")
+
+    # Create pipeline
+    pipeline = AniSoraI2VCogVideoXPipeline(
+        model_path="Disty0/Index-anisora-5B-diffusers",
+        dtype=torch.bfloat16,
+    )
+    pipeline.to("cuda")
+
+    # Download test image
+    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
+    urllib.request.urlretrieve(url, "/tmp/cat.png")
+    image = PIL.Image.open("/tmp/cat.png").convert("RGB")
+
+    # Generate
+    output = pipeline(
+        prompt="a cat walking in the garden, high quality",
+        image=image,
+        negative_prompt="low quality, blurry",
+        num_inference_steps=10,
+        height=480,
+        width=832,
+        num_frames=17,
+    )
+
+    print(f"Output type: {type(output)}")
+    print(f"Output.output shape: {output.output.shape}")
+
+    # Check for NaN
+    if torch.isnan(output.output).any():
+        print("WARNING: Output contains NaN!")
+    else:
+        print("Output looks valid (no NaN)")
+
+    # Save video
+    from diffusers.utils import export_to_video
+
+    video = output.output[0].permute(1, 2, 3, 0).cpu().numpy()  # [C, F, H, W] -> [F, H, W, C]
+    video = ((video + 1) / 2 * 255).clip(0, 255).astype("uint8")
+    export_to_video(video, "/workspace/test_cogvideox.mp4", fps=16)
+    print("Video saved to /workspace/test_cogvideox.mp4")


The if __name__ == "__main__" block instantiates AniSoraI2VCogVideoXPipeline(model_path=..., dtype=...), but the class __init__ only takes od_config (plus keyword-only) and doesn’t accept these arguments. This makes the in-file test unusable and may mislead users about how to construct the pipeline; it should either be removed or rewritten to go through OmniDiffusionConfig / the Omni entrypoint.

Suggested change

# Simple test

if __name__ == "__main__":

import urllib.request

print("Testing AniSora I2V CogVideoX Pipeline...")

# Create pipeline

pipeline = AniSoraI2VCogVideoXPipeline(

model_path="Disty0/Index-anisora-5B-diffusers",

dtype=torch.bfloat16,

)

pipeline.to("cuda")

# Download test image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"

urllib.request.urlretrieve(url, "/tmp/cat.png")

image = PIL.Image.open("/tmp/cat.png").convert("RGB")

# Generate

output = pipeline(

prompt="a cat walking in the garden, high quality",

image=image,

negative_prompt="low quality, blurry",

num_inference_steps=10,

height=480,

width=832,

num_frames=17,

)

print(f"Output type: {type(output)}")

print(f"Output.output shape: {output.output.shape}")

# Check for NaN

if torch.isnan(output.output).any():

print("WARNING: Output contains NaN!")

else:

print("Output looks valid (no NaN)")

# Save video

from diffusers.utils import export_to_video

video = output.output[0].permute(1, 2, 3, 0).cpu().numpy() # [C, F, H, W] -> [F, H, W, C]

video = ((video + 1) / 2 * 255).clip(0, 255).astype("uint8")

export_to_video(video, "/workspace/test_cogvideox.mp4", fps=16)

print("Video saved to /workspace/test_cogvideox.mp4")

ZJY0516 · 2026-02-04T08:58:41Z

@dorhuri123 It seems that the first video has accuracy problems

dorhuri123 · 2026-02-04T11:24:47Z

@ZJY0516 agreed — the first output looks clearly off (strong color inversion / desaturation compared to the input). I re‑ran the exact same settings and got a cleaner output on my side(after all the changes that were done to suite the existing example file).

Could you point to the specific behavior you want to treat as the “accuracy issue” (e.g., color inversion, identity drift, motion artifacts)? That would help me isolate whether it’s still a code path issue or just variability.

Same input/settings for both runs:

attempt 1

anisora_v1_demo.mp4

attempt 2

anisora_v1_demo.1.mp4

ZJY0516 · 2026-02-04T11:31:44Z

The shape and movement of the cat in the video don’t look quite right. Could you compare this with the official implementation to verify? @dorhuri123

dorhuri123 · 2026-02-05T08:47:11Z

@ZJY0516 I compared against the official Diffusers CogVideoXImageToVideoPipeline using the same input/settings on an RTX 6000 (Blackwell Server Edition). The output shows the same shape/motion characteristics as the vLLM run, so it looks like this is model behavior rather than an integration issue.

Baseline (official diffusers) commands + script used:

# env
python -m venv ~/anisora-diffusers
source ~/anisora-diffusers/bin/activate
pip install --upgrade pip
pip install diffusers==0.36.0 transformers accelerate safetensors huggingface_hub \
            sentencepiece tiktoken protobuf imageio imageio-ffmpeg
pip install --pre --upgrade torch --index-url https://download.pytorch.org/whl/nightly/cu128

# input image
wget -O /tmp/cat.png https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png

# run_diffusers_anisora_v1.py
import torch
import PIL.Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "Disty0/Index-anisora-5B-diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = PIL.Image.open("/tmp/cat.png").convert("RGB")

video = pipe(
    prompt="A cat playing with yarn",
    image=image,
    height=480,
    width=720,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
    output_type="np",
).frames[0]

export_to_video(video, "anisora_v1_diffusers.mp4", fps=16)

I’ll attach the diffusers output video in this comment. If you’re seeing a specific artifact you want addressed, let me know the exact behavior and I’ll dig deeper.

anisora_v1_diffusers.mp4

hsliuustc0106 · 2026-02-11T04:21:40Z

resolve conflicts please

hsliuustc0106 · 2026-02-11T04:24:31Z

@ZJY0516 I compared against the official Diffusers CogVideoXImageToVideoPipeline using the same input/settings on an RTX 6000 (Blackwell Server Edition). The output shows the same shape/motion characteristics as the vLLM run, so it looks like this is model behavior rather than an integration issue.

Baseline (official diffusers) commands + script used:

# env
python -m venv ~/anisora-diffusers
source ~/anisora-diffusers/bin/activate
pip install --upgrade pip
pip install diffusers==0.36.0 transformers accelerate safetensors huggingface_hub \
            sentencepiece tiktoken protobuf imageio imageio-ffmpeg
pip install --pre --upgrade torch --index-url https://download.pytorch.org/whl/nightly/cu128

# input image
wget -O /tmp/cat.png https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png

# run_diffusers_anisora_v1.py
import torch
import PIL.Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "Disty0/Index-anisora-5B-diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = PIL.Image.open("/tmp/cat.png").convert("RGB")

video = pipe(
    prompt="A cat playing with yarn",
    image=image,
    height=480,
    width=720,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
    output_type="np",
).frames[0]

export_to_video(video, "anisora_v1_diffusers.mp4", fps=16)

I’ll attach the diffusers output video in this comment. If you’re seeing a specific artifact you want addressed, let me know the exact behavior and I’ll dig deeper.

anisora_v1_diffusers.mp4

which transformers version you are using?

hsliuustc0106 · 2026-02-11T04:25:31Z

did you keep the seed as the same for comparision with diffusers?

dorhuri123 · 2026-02-11T16:16:08Z

@ZJY0516 I compared against the official Diffusers CogVideoXImageToVideoPipeline using the same input/settings on an RTX 6000 (Blackwell Server Edition). The output shows the same shape/motion characteristics as the vLLM run, so it looks like this is model behavior rather than an integration issue.
Baseline (official diffusers) commands + script used:
# env
python -m venv ~/anisora-diffusers
source ~/anisora-diffusers/bin/activate
pip install --upgrade pip
pip install diffusers==0.36.0 transformers accelerate safetensors huggingface_hub \
            sentencepiece tiktoken protobuf imageio imageio-ffmpeg
pip install --pre --upgrade torch --index-url https://download.pytorch.org/whl/nightly/cu128

# input image
wget -O /tmp/cat.png https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png
# run_diffusers_anisora_v1.py
import torch
import PIL.Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "Disty0/Index-anisora-5B-diffusers",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = PIL.Image.open("/tmp/cat.png").convert("RGB")

video = pipe(
    prompt="A cat playing with yarn",
    image=image,
    height=480,
    width=720,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
    output_type="np",
).frames[0]

export_to_video(video, "anisora_v1_diffusers.mp4", fps=16)
I’ll attach the diffusers output video in this comment. If you’re seeing a specific artifact you want addressed, let me know the exact behavior and I’ll dig deeper.
anisora_v1_diffusers.mp4
which transformers version you are using?

I didn't pin the transformers version in that comparison — I ran pip install transformers which installed the latest version compatible with diffusers 0.36.0. I don't have that environment anymore so I can't check the exact version. I can re-create the environment and report back with the exact version if needed.

dorhuri123 · 2026-02-11T16:16:10Z

did you keep the seed as the same for comparision with diffusers?

Good catch — I didn't set a fixed seed in the diffusers baseline script. The comparison was qualitative, showing that the same motion/shape characteristics appear in both implementations. I can re-run with a fixed seed in both (generator=torch.Generator("cuda").manual_seed(42)) if needed.

lishunyang12

Nice work on this — left a few thoughts inline, mostly around some small things I noticed.

lishunyang12 · 2026-02-22T01:55:02Z

 |`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
 |`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
-|`FluxPipeline` | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` |
 |`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` |


Looks like this diff might have accidentally deleted the FluxPipeline and Qwen3TTSForConditionalGeneration rows — probably a rebase artifact? Also, would it make sense to add the AniSora V2 (14B) entry to the table too?

Good catch — this was indeed a rebase artifact. I've synced the GPU table with upstream main (restored FluxPipeline and the three Qwen3TTSForConditionalGeneration rows) and added the AniSora V2 (14B) entry as well.

Makes sense, thanks for cleaning that up.

lishunyang12 · 2026-02-22T01:55:02Z

+        "pipeline_anisora_i2v_cogvideox",
+        "AniSoraI2VCogVideoXPipeline",
+    ),
 }


Just something I was wondering about — registering CogVideoXImageToVideoPipeline as the key means any model declaring that class name would get routed here, including vanilla CogVideoX I2V models. Would a more specific key work better, or is there a reason for keeping it generic?

You're right — using the generic diffusers class name would hijack vanilla CogVideoX models. I've renamed the registry key to AniSoraI2VCogVideoXPipeline and added a targeted mapping in omni_diffusion.py that only converts CogVideoXImageToVideoPipeline → AniSoraI2VCogVideoXPipeline when "anisora" appears in the model name. This way vanilla CogVideoX models are unaffected.

Much better — scoping by model name avoids the hijacking issue.

lishunyang12 · 2026-02-22T01:55:02Z

+        # Load weights from AniSora
+        logger.info("Downloading AniSora weights...")
+        import glob
+        import os as os_module


Minor nit — os is already imported at the top of the file (line 21), so the import os as os_module here shadows it a bit. Not a big deal though.

Fixed — removed the redundant import and switched both usages to the top-level os.

lishunyang12 · 2026-02-22T01:55:02Z

+
+        # Load state dict
+        missing, unexpected = self.transformer.load_state_dict(converted_state_dict, strict=False)
+        if missing:


I noticed missing keys are only logged at debug level, which is off by default. Since a wrong key mapping could be tricky to debug, would it help to log at warning level or add a threshold check? Just a thought.

Good point — a broken key mapping would be very hard to diagnose with debug-level logging. Changed both missing and unexpected keys to warning level, and removed the 10-key threshold so all keys are always logged. This way any mismatch is immediately visible.

lishunyang12 · 2026-02-22T01:55:02Z

+
+            # Classifier-free guidance
+            if do_classifier_free_guidance:
+                noise_uncond = self.transformer(


I noticed CFG here runs the transformer twice per step instead of batching conditional and unconditional together. The V1 pipeline does batch them with torch.cat. Is there a specific reason V2 does it differently, or could it use the same approach?

No specific reason — this was an oversight. I've refactored V2 to batch conditional and unconditional inputs with torch.cat in a single forward pass, matching V1's approach. This halves the number of transformer calls per denoising step.

Nice, batching should cut the per-step cost significantly.

lishunyang12 · 2026-02-22T01:55:02Z

+            if isinstance(first_prompt, dict):
+                additional_info = first_prompt.get("additional_information", {})
+                if isinstance(additional_info, dict) and isinstance(
+                    additional_info.get("preprocessed_image"), PIL.Image.Image


I might be misreading this, but video_processor.preprocess() returns a torch.Tensor, so the isinstance(..., PIL.Image.Image) check would always be False, making this branch unreachable. Is the intent to always go through multi_modal_data instead?

You're right — preprocessed_image is always a tensor from VideoProcessor.preprocess(), so the PIL check was dead code. I've simplified the logic to go directly to multi_modal_data["image"] which holds the PIL image needed for CLIP conditioning.

lishunyang12 · 2026-02-22T01:55:02Z

+        logger.info("Encoding prompts...")
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(prompt, negative_prompt)
+
+        do_classifier_free_guidance = guidance_scale > 1.0 and negative_prompt_embeds is not None


Just want to make sure — guidance_scale seems to only take effect when negative_prompt_embeds is provided, but I don't see a default negative prompt being set. Is that intentional, or should there be an empty-string default when guidance_scale > 1.0?

Not intentional — without a default, CFG was silently becoming a no-op when no negative prompt was provided. I've added a default of "" (empty string) in both V1 and V2 when guidance_scale > 1.0 and no negative prompt is given. This matches the behavior of diffusers' official pipelines.

Good fix — matching diffusers' default behavior seems right.

lishunyang12 · 2026-02-22T01:55:02Z

+
+# Default paths for components
+DEFAULT_WAN_BASE = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
+DEFAULT_ANISORA_TRANSFORMER = "aardsoul-music/Wan2.1-Anisora-14B"


Small thing — DEFAULT_ANISORA_TRANSFORMER doesn't seem to be used anywhere. Is it planned for future use, or can it be removed?

No plans for it — the transformer path always comes from od_config.model. Removed.

lishunyang12 · 2026-02-22T01:55:02Z

+            model_id = (od_config.model or "").lower()
+            if (
+                od_config.model_class_name is None
+                and "anisora" in model_id


The "anisora" in model_id check might be a bit fragile — someone with a path like /data/anisora_experiment/some-other-model could accidentally match. Would a config-based check be more reliable? Just a thought.

Good point. I've changed both AniSora detection paths to check os.path.basename() of the model path/ID instead of the full string, so only the actual model name is matched. A fully config-based approach would require changes upstream (e.g., a field in config.json), so basename matching is the best we can do for now since these community repos don't ship model_index.json.

Yeah, basename matching sounds like a reasonable middle ground for now.

lishunyang12 · 2026-02-22T01:55:02Z

+
+class AniSoraI2VCogVideoXPipeline(nn.Module):
+    # vLLM uses this flag to decide whether to feed dummy images in warmup
+    support_image_input = True


Minor thing — the class docstring is after support_image_input = True, so Python would attach it to that attribute rather than the class. Might want to move it up?

Fixed — moved the docstring to the first statement in the class body (before support_image_input) in both V1 and V2 pipelines.

hsliuustc0106 · 2026-02-22T13:48:16Z

@vllm-omni-reviewer

github-actions · 2026-02-22T13:50:12Z

🤖 VLLM-Omni PR Review

Code Review: Add Index-AniSora I2V Support

1. Overview

This PR adds support for Index-AniSora Image-to-Video models, supporting both the 5B (CogVideoX-based) and 14B (Wan2.1-based) variants. The implementation includes:

Two new pipeline modules with hybrid loading for V2
Registry and entrypoint integration
Documentation and example updates

Overall Assessment: LGTM with suggestions - The implementation is well-structured and follows existing patterns, but has several issues that should be addressed before merging.

2. Code Quality

Positive Aspects

Well-documented code with clear docstrings and comments
Good logging throughout for debugging
Clean separation between V1 and V2 architectures
Proper type hints used consistently

Issues Found

Critical: Incorrect ValueError usage

vllm_omni/diffusion/models/anisora/pipeline_anisora_i2v_cogvideox.py:79-82

raise ValueError(
    """No image is provided. This model requires an image to run.""",
    """Please correctly set `"multi_modal_data": {"image": <an image object or file path>, …}`""",
)

This raises ValueError with two arguments, which creates a tuple exception message. Same issue at lines 85-88 and in pipeline_anisora_v2_i2v.py:77-84.

Fix:

raise ValueError(
    "No image is provided. This model requires an image to run. "
    "Please correctly set `multi_modal_data: {image: <an image object or file path>, …}`"
)

Potential Bug: Duplicate weight loading

vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py:248-252

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
    """Load weights using AutoWeightsLoader for vLLM integration."""
    loader = AutoWeightsLoader(self)
    return loader.load_weights(weights)

The V2 pipeline loads weights manually in __init__ (lines 196-243), but also provides load_weights. If vLLM calls load_weights after initialization, weights could be loaded twice or incorrectly.

Suggestion: Either remove load_weights for V2 or make __init__ not load weights and rely on vLLM's weight loading mechanism.

3. Architecture & Design

Positive Aspects

Hybrid loading approach for V2 is well-documented and necessary for community weight compatibility
Pre/post-process functions follow existing patterns in the codebase
Key name conversion logic is comprehensive and clearly documented

Concerns

Fragile model detection logic

vllm_omni/entrypoints/omni_diffusion.py:71-77

if (
    class_name == "CogVideoXImageToVideoPipeline"
    and "anisora" in os.path.basename((od_config.model or "").rstrip("/")).lower()
):
    class_name = "AniSoraI2VCogVideoXPipeline"

This string matching could match unintended models (e.g., /models/anisora_experiment/other-model). Consider:

Adding a config file check for AniSora-specific markers
Using a more specific pattern match
Documenting the naming convention requirement

Forced download in V2 pipeline

vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py:196-200

if local_anisora:
    weight_path = model_path
else:
    weight_path = snapshot_download(model_path, local_files_only=False)

The local_files_only=False forces network access even when files might be cached. Should respect offline mode:

weight_path = snapshot_download(model_path, local_files_only=local_anisora)

Missing offline mode support for Wan base

vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py:167-175

The Wan2.1 base components are loaded with local_files_only=local_wan, but local_wan is determined by checking if the default path exists locally, which will almost always be False for the default HuggingFace ID.

4. Security & Safety

Input Validation

Missing path validation

vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py:201-206

safetensor_files = glob.glob(os.path.join(weight_path, "*.safetensors"))
if not safetensor_files:
    safetensor_files = glob.glob(os.path.join(weight_path, "**/*.safetensors"), recursive=True)

The glob patterns could potentially match unintended files. Consider:

Validating that files are within the expected directory
Filtering for specific expected file patterns

Silent failure for missing CLIP encoder

vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py:177-184

except Exception as e:
    logger.warning("CLIP image encoder not available: %s", e)
    self.image_processor = None
    self.image_encoder = None
    self.has_image_encoder = False

Catching broad Exception and continuing could mask real issues. The I2V pipeline may produce degraded results without CLIP conditioning. Consider:

Making CLIP encoder required for I2V models
Or raising a clear warning that output quality may be affected

5. Testing & Documentation

Documentation

✅ README updated with new models
✅ Usage examples provided
✅ Demo results with specific settings
✅ Technical notes explain hybrid loading approach

Missing

❌ No unit tests for the new pipelines
❌ No validation tests for key conversion logic
❌ No tests for model detection logic in entrypoint

Suggested Test Cases

Key conversion correctness (compare converted keys against expected mapping)
Model detection logic for various model name patterns
Error handling for missing images
Dimension validation for various input sizes

6. Specific Suggestions

`pipeline_anisora_i2v_cogvideox.py`

Line	Issue	Suggestion
79-82	ValueError with multiple args	Combine into single string
85-88	ValueError with multiple args	Combine into single string
368-370	load_weights may conflict with init loading	Document that this is for vLLM internal use only

`pipeline_anisora_v2_i2v.py`

Line	Issue	Suggestion
77-84	ValueError with multiple args	Combine into single string
167-175	Offline mode not properly supported	Add `local_files_only` parameter to config
196-200	Forced network access	Use `local_files_only=local_anisora`
248-252	Duplicate weight loading risk	Remove or document behavior clearly

`omni_diffusion.py`

Line	Issue	Suggestion
71-77	Fragile string matching	Add additional validation or document naming requirements
88-97	Fragile string matching	Same as above

`registry.py`

Line	Issue	Suggestion
289-290	Post-process func names	Consider adding docstrings explaining the func signatures

7. Approval Status

LGTM with suggestions

The PR is well-structured and follows existing patterns in the codebase. The hybrid loading approach for V2 is necessary and well-implemented. However, the following should be addressed:

Required before merge:

Fix the ValueError multi-argument issue (affects error messages displayed to users)

Recommended:
2. Fix the local_files_only=False forced download in V2 pipeline
3. Add basic unit tests for key conversion logic
4. Document the model naming convention requirement for auto-detection

Optional improvements:
5. Consider making CLIP encoder required for V2 I2V
6. Add validation for the load_weights vs __init__ weight loading in V2

This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

lishunyang12

All previous comments addressed — the CFG batching, registry rename, logging, and docstring fixes all look good. The only remaining heuristic is the basename check in omni_diffusion.py, but that's a reasonable approach for now. LGTM.

dorhuri123 · 2026-02-22T18:41:50Z

Thanks for the thorough review and the LGTM! I really appreciate you taking the time to go through everything.

The @vllm-omni-reviewer bot flagged a few additional items — most were false positives or already covered, but it did catch a real bug: our ValueError calls were passing two string arguments (creating a tuple message) instead of a single concatenated string. Just pushed a fix for that in 57e3eb0.

All feedback addressed — ready for merge whenever you're comfortable!

hsliuustc0106 · 2026-02-23T02:10:47Z

@wtomin PTAL

dorhuri123 · 2026-03-19T17:14:23Z

@wtomin

PR Update: Benchmarks, E2E Tests, SP Support & V2 TP Bug Report

What this PR adds

V1 (5B, CogVideoX-based) I2V pipeline with Ulysses sequence parallelism (SP), tensor parallelism (TP), and FP8 quantization
V2 (14B, Wan2.1-based) I2V pipeline with AniSora→diffusers weight key conversion and TP support
Benchmark script (benchmark_anisora.py) with --warmup, --tp, --sp, --quantization flags
E2E offline tests: single GPU, TP=2, SP=2, FP8, V2 TP=2
E2E online serving tests: full job lifecycle (create → poll → download → delete)

Benchmark Results (2× H100 80GB)

Model	Backend	VRAM (GiB)	Latency (s)
V1 (5B)	diffusers	34.46	107.88
V1 (5B)	vllm-omni (TP=1)	76.86	200.75
V1 (5B)	vllm-omni (TP=2)	122.05	149.25
V1 (5B)	vllm-omni (SP=2)	135.36	131.77
V1 (5B)	vllm-omni (FP8)	65.95	200.92
V2 (14B)	vllm-omni (TP=2)	82.44	274.22

SP=2 gives ~34% speedup vs TP=1 (131.77s vs 200.75s)
FP8 reduces VRAM by ~14% (76.86 → 65.95 GiB) with no latency impact

E2E Test Results

Offline inference — 5/5 passed (159s):

test_anisora_v1_offline_single_gpu  PASSED
test_anisora_v1_offline_tp2         PASSED
test_anisora_v2_offline_tp2         PASSED
test_anisora_v1_offline_sp2         PASSED
test_anisora_v1_offline_fp8         PASSED

Online serving — 2/2 passed (116s):

test_anisora_v1_online_create_poll_download_delete  PASSED
test_anisora_v2_online_create_poll_download_delete  PASSED

Known Issue: V2 (Wan2.1 14B) Quality Degradation with TP=2

We discovered a pre-existing bug in the Wan2.1 transformer's TP=2 weight sharding that causes severe mosaic artifacts. This is NOT introduced by this PR — it exists in the base Wan2.1 TP implementation. All 1143/1143 transformer weights load correctly (verified with diagnostics).

V2 with TP=1 (correct output):

v2_sample.1.mp4

V2 with TP=2 (mosaic artifacts):

v2_sample.mp4

This should be tracked as a separate issue for the Wan2.1 transformer TP implementation.

wtomin · 2026-03-23T13:05:02Z

@@ -0,0 +1,197 @@
+# SPDX-License-Identifier: Apache-2.0


Please take this RFC #1832 as reference of your online serving test script. You can test TP only now.

Done — added test_anisora_v1_online_tp2_create_poll_download_delete (full job lifecycle with --tensor-parallel-size 2), following your guidance from #1832.

The naming of this file has minor mismatch. Please check #1682 as a reference.

In test-nightly-diffusion.yaml, the online serving tests are launched by:

pytest -s -v tests/e2e/online_serving/test_*_expansion.py

Therefore, I recommend you to rename this test script to test_anisora_expansion.py.

Afterwards, I can add a nightly-test label, and launch a buildkite test with this model's online serving test.

wtomin · 2026-03-23T13:12:27Z

An existing issue related to tp accuracy problem #1713. Please check if it is the same problem.

Besides, since it supports FP8, please update docs/user_guide/diffusion_acceleration.md

dorhuri123 · 2026-03-26T08:50:19Z

@wtomin
Confirmed — same root cause as #1713: unsynchronized RNG across ranks. Without a fixed seed, each rank initializes from independent noise, causing mosaic artifacts. For TP=2 the starting latents diverge across ranks; for SP=2 it's worse — the noise tensor is spatially split across ranks, so independent RNG creates a hard discontinuity at the split boundary that propagates through every denoising step.

Our offline tests pass seed=42 explicitly and all pass correctly. The online path already accepts a seed field in the request payload, which avoids the issue when set. A proper fix would be a sensible default seed in OmniDiffusionSamplingParams as suggested in #1713.

Also updated docs/user_guide/diffusion_acceleration.md: added AniSora V1 to the VideoGen (Ulysses-SP ✅) and Quantization (FP8 ✅) tables.

wtomin · 2026-03-31T03:52:45Z

+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+E2E offline inference tests for Index-AniSora I2V models.
+


To check the functionality, we prioritize online serving test script over offline inference script. If you test cases are overlapped in the two test scripts, I recommend you to maintain the test case (e.g., tp=2) in online serving test script, and you can delete the test case in offline inference test script. This prevents duplicated test cases.

Removed test_anisora_v1_offline_tp2 and test_anisora_v2_offline_tp2 from the offline test file. TP=2 lifecycle coverage is now maintained only in test_anisora_online.py via test_anisora_v1_online_tp2_create_poll_download_delete as recommended.

wtomin · 2026-03-31T03:54:51Z

Please resolve the conflicts.

dorhuri123 · 2026-04-09T00:01:00Z

Conflicts resolved and rebased on latest main. Picked up upstream's doc restructure (acceleration docs merged into diffusion_features.md) and added AniSora V1/V2 to the new VideoGen feature table there.

dorhuri123 · 2026-04-20T22:58:33Z

@wtomin Rebased on latest main and resolved the conflicts (updated the new diffusion_features.md with AniSora V1/V2 rows and kept the registry merged). Could you take another look when you get a chance? Thanks!

wtomin · 2026-05-14T02:48:01Z

@dorhuri123 Sorry for the delay. Could you rebase to the latest main? I will try to merge this PR recently.

…nchmarks - Add AniSora V1 (5B, CogVideoX-based) I2V pipeline with Ulysses SP, TP, and FP8 quantization support - Add AniSora V2 (14B, Wan2.1-based) I2V pipeline with AniSora→diffusers weight key conversion and TP support - Register both pipelines and their pre/post-process hooks in the diffusion registry; route via OmniDiffusion entrypoint - Add e2e offline tests (single GPU, SP=2, FP8) and online serving tests (V1 single GPU, V1 TP=2, V2) covering full job lifecycle - Add AniSora rows to `docs/models/supported_models.md` and the new VideoGen feature table in `docs/user_guide/diffusion_features.md` Signed-off-by: Dor Huri <Dorhuri123@gmail.com>

dorhuri123 · 2026-05-24T16:55:22Z

@wtomin Thanks! Rebased onto latest main and resolved the conflicts, now a single clean commit, ready to merge.

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

dorhuri123 requested a review from hsliuustc0106 as a code owner January 20, 2026 22:58

chatgpt-codex-connector Bot reviewed Jan 20, 2026

View reviewed changes

dorhuri123 force-pushed the feature/index-anisora branch from d1c2809 to 537f736 Compare January 20, 2026 23:56

lishunyang12 reviewed Jan 21, 2026

View reviewed changes

lishunyang12 reviewed Jan 27, 2026

View reviewed changes

dorhuri123 force-pushed the feature/index-anisora branch from 1c579ed to b1182be Compare January 28, 2026 09:19

hsliuustc0106 requested a review from Copilot January 28, 2026 21:02

Copilot started reviewing on behalf of hsliuustc0106 January 28, 2026 21:02 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

dorhuri123 requested a review from lishunyang12 February 5, 2026 08:52

lishunyang12 reviewed Feb 22, 2026

View reviewed changes

lishunyang12 approved these changes Feb 22, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 23, 2026

dorhuri123 force-pushed the feature/index-anisora branch from beaf8af to 7e9fd7d Compare March 3, 2026 12:55

wtomin reviewed Mar 23, 2026

View reviewed changes

Comment thread examples/offline_inference/image_to_video/benchmark_anisora.py Outdated

Gaohan123 removed this from the v0.18.0 milestone Mar 23, 2026

dorhuri123 force-pushed the feature/index-anisora branch from 8402714 to 241d9c6 Compare March 26, 2026 08:41

wtomin mentioned this pull request Mar 26, 2026

[RFC]: vLLM-Omni Diffusion Module — Q2 2026 Roadmap #2226

Open

25 tasks

wtomin reviewed Mar 31, 2026

View reviewed changes

dorhuri123 force-pushed the feature/index-anisora branch 2 times, most recently from 95f995e to aeace8e Compare April 8, 2026 23:59

dorhuri123 force-pushed the feature/index-anisora branch from aeace8e to 1edfb06 Compare April 20, 2026 21:40

dorhuri123 force-pushed the feature/index-anisora branch from 1edfb06 to 768ad3a Compare April 28, 2026 16:38

dorhuri123 force-pushed the feature/index-anisora branch from 768ad3a to 043d1b0 Compare May 24, 2026 16:49

dorhuri123 requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride and yenuo26 as code owners May 24, 2026 16:49

wtomin reviewed May 28, 2026

View reviewed changes

Comment thread tests/e2e/offline_inference/test_anisora_i2v.py Outdated

Apply suggestion from @wtomin

377c78f

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

Conversation

dorhuri123 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Supported Models

Demo Results

AniSora V1 (5B) - RTX 6000

AniSora V2 (14B) - Short - NVIDIA H200

AniSora V2 (14B) - Long - NVIDIA H200

Usage

V1 (5B)

V2/V3 (14B)

Changes

New Files

Modified Files

Technical Notes

V2 Hybrid Loading

Key Name Conversions

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Jan 24, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Jan 27, 2026

Uh oh!

dorhuri123 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ZJY0516 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dorhuri123 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJY0516 commented Feb 4, 2026

Uh oh!

dorhuri123 commented Feb 5, 2026

Uh oh!

hsliuustc0106 commented Feb 11, 2026

Uh oh!

hsliuustc0106 commented Feb 11, 2026

Uh oh!

hsliuustc0106 commented Feb 11, 2026

Uh oh!

dorhuri123 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dorhuri123 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

dorhuri123 commented Jan 20, 2026 •

edited

Loading

dorhuri123 commented Jan 28, 2026 •

edited

Loading

ZJY0516 commented Feb 4, 2026 •

edited

Loading

dorhuri123 commented Feb 4, 2026 •

edited

Loading

dorhuri123 commented Feb 11, 2026 •

edited

Loading

dorhuri123 commented Feb 11, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading

lishunyang12 Feb 22, 2026 •

edited

Loading